Back

Research Synthesis Methods

Wiley

Preprints posted in the last 7 days, ranked by how well they match Research Synthesis Methods's content profile, based on 20 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit.

1
Study Design Indexing in Transition: A Focused Comparison of manual NLM Indexing vs. Transformer-based Automated Models

Das, P.; Schneider, J.; Mayo-Wilson, E.; Kilicoglu, H.; Menke, J. D.; Nam, D.; Ninan, K.; Oberste, J.-P.; Troy, A. M.; Ying, X.; Holt, A. W.; Smalheiser, N. R.

2026-06-04 health informatics 10.64898/2026.06.03.26354854 medRxiv
Top 0.1%
14.5%
Show abstract

Objectives: Study design indexing of biomedical publications is crucial for evidence retrieval and synthesis. We sought to evaluate the accuracy and suitability of a transformer-based model (TM) for indexing clinical study designs, in comparison to National Library of Medicine (NLM) indexing. However, this is challenging for at least three reasons: First, to date, all automated systems have been trained and evaluated on manual NLM indexing assignments, itself subject to errors. Second, TM's probabilistic predictive scores take into account uncertainty, and can be converted to TRUE/FALSE assignments in different ways depending on the needs of users, while NLM labels are categorical. Third, our goal (to tag articles only that exhibit a given design) differs from NLM which tags articles that both discuss as well as exhibit that design. Materials and Methods: Therefore, we carried out a limited evaluation of the TM model that focuses only on the articles that received the most confident predictions, that is, the highest scores that are almost certainly TRUE and the lowest scores that are almost certainly FALSE, but which disagreed with NLM assignments. This was performed both for articles published in 2016 (when NLM decisions were manual) and in 2025 (when NLM decisions were automated). To establish ground truth, dual annotators indexed the articles independently, following written definitions, for four prominent study designs--cohort, case-control, cross-sectional, and case report. Results: For three designs (case-control, case report, cross-sectional), the articles having the top 100 predictive TM scores (when NLM failed to assign that design) were judged to exhibit that design in the great majority (86-100%) of cases. Conversely, the articles having the lowest 100 predictive TM scores (when NLM did assign the study design) exhibited the design only in relatively few (0-21%) of cases. The most confident predictions of the TM model were highly accurate and not redundant with automated NLM indexing; the exception was cohort studies articles, in which both TM and NLM labels showed high error rates of both omission and commission. Discussion and Conclusion: TM may have value for identifying articles exhibiting study designs, which is especially important for clinical decision-making as well as systematic reviews and other evidence syntheses. NLM indexing of cohort studies cannot be regarded as a reliable gold standard for training or evaluation of automated systems, warranting efforts to create a new manually annotated corpus.

2
Positioning Early Phase CNS Trials for Regulatory and Investor Success: Strategic Implications of the Single Phase 3 Approval Paradigm

Schmidt, P.; Preskorn, S.

2026-06-08 neurology 10.64898/2026.06.05.26353604 medRxiv
Top 0.2%
0.8%
Show abstract

In February 2026, the FDA announced that a single pivotal phase 3 (P3) trial would become the new default standard for drug approval - a regulatory direction that had been legally enabled since the FDA Modernization Act of 1997. This announcement has strategic, scientific, and economic implications for drug developers, contract research organizations (CROs), and biotech investors. We argue that the expansion of this framework, originally reserved for various niche submissions, represents a paradigm change, dramatically increasing the value of rigorous early phase (P1 and P2) trial design, requiring sponsors to establish both statistical efficacy signals and mechanistic biological understanding before entering phase 3. Using a CNS indication cost model, we show that single P3 approval can reduce total development expenditure from approximately $447 million over 14 years to $297 million over 12 years - a savings of $150 million and providing two years of additional commercial runway for a modeled CNS drug. Case examples including lecanemab, omaveloxolone, and tofersen illustrate how biomarker-informed early phase strategies can establish the confirmatory evidence necessary for single-trial approval. We provide practical guidance for maximizing the value of P1 and P2 under this evolving framework.

3
Large Language Models in Healthcare Simulation Education: A Bibliometric Analysis with AI-Assisted Screening

Pears, M.; Wadhwa, K.; Payne, S. R.; Konstantinidis, S. T. H.; Biyani, C. S.

2026-06-04 urology 10.64898/2026.06.02.26354722 medRxiv
Top 0.4%
0.2%
Show abstract

Large language models (LLMs) such as ChatGPT are rapidly reshaping healthcare education and simulation-based training in non-technical skills (NTS), yet no bibliometric analysis has mapped this landscape. We searched seven open-access databases (OpenAlex, PubMed, Europe PMC, Crossref, Semantic Scholar, CORE, DOAJ) for English-language publications from January 2020 to March 2026. From 100,277 initial records, a sequential keyword funnel yielded 830 candidate papers, which were screened by 83 independent Claude Sonnet 4.6 AI agents applying pre-specified inclusion criteria (PRISMA-trAIce compliant; Cohen's kappa = 0.86 pre-reconciliation, 1.0 post-reconciliation). The final AI-verified corpus comprised 551 papers with a compound annual growth rate of 109%, contributions from 2,398 authors across 279 journals in 58 countries, and an h-index of 41. ChatGPT dominated the model landscape (46% of papers), with open-source models virtually absent. Virtual patient chatbots were the leading simulation modality (106 papers). Among NTS domains, communication (145 papers) and decision-making (135 papers) were most studied, whereas teamwork, leadership, situational awareness, and crisis resource management were markedly underrepresented. Only 6 urology-relevant papers were identified, none examining LLM integration within boot camp training formats. The field is growing at extraordinary pace but remains concentrated in a narrow range of NTS domains and a single proprietary model. Critical gaps persist in team-based skills training, open-source model evaluation, and specialty-specific simulation. AI-assisted bibliometric screening using multiple independent agents is feasible, reliable, and scalable, offering a replicable methodology for mapping fast-evolving research fields.

4
A New Mixed Frequency Regression Model For Environmental Epidemiology

Shukla, N.; Bartington, S. E.; Hansell, A. L.; Lucas, T. C.

2026-06-04 epidemiology 10.64898/2026.06.03.26354801 medRxiv
Top 0.6%
0.1%
Show abstract

Background: In the absence of high-resolution response data, exposure-response modelling often relies on aggregated low-frequency exposure data, leading to loss of high-resolution information. Mixed Data Sampling (MIDAS) from econometrics offers an alternative but is limited due to its inability to make high-resolution predictions, inflexible likelihoods and penalised nonlinear functions, and limited visualization options. We propose a mixed-frequency Distributed Lag Non-linear Model (mf-DLNM) which can eliminate the need to aggregate exposure data in environmental epidemiology and provide high resolution predictions for time series studies. Methods: We evaluated the inference and predictive performance of the mf-DLNM. To evaluate its ability to estimate exposure-response relationships, we applied mf-DLNM and same-frequency (sf)-DLNM using data from the West Midlands, UK. Additionally, we compared the predictive performance of mf-DLNM with sf-DLNM and MIDAS across nine regions of England. As MIDAS cannot predict at the resolution of the predictor (daily), we compared the predictive performance of mf-DLNM and MIDAS at weekly resolution. To test the model's ability to predict high temporal resolution risk (daily), we compared sf-DLNM (with access to daily mortality counts) with mf-DLNM (with access only to weekly mortality counts). Results: In the West Midlands example, mf-DLNM performed comparably to sf-DLNM in estimating daily risk of temperature on respiratory mortality. Furthermore, mf-DLNM and MIDAS exhibited similar performance for weekly predictions. For high-resolution predictions, mf-DLNM and sf-DLNM showed nearly similar performance, despite mf-DLNM having access only to low-resolution response data. Conclusion: This mixed-frequency approach in environmental epidemiology overcomes the limitations of predicting health risks using aggregated exposure data and provides estimates of high-resolution outcomes in the absence of high-frequency health outcome datasets.

5
Clinician-Centered Evaluation of Large Language Model-Generated Discharge Summaries for Longer Hospitalizations: Insights from Hospitalists and Primary Care Physicians

Osborne, T.; Mahmud, T.; Zheng, X.; Jampala, S.; Abbasi, S.; Hong, S.; Kranz, K.; Lee, S.; Ng, P.; Odekon, K.; Schachter, L.; Sexton, R.; Spinnato, T.; Tharakan, M.; Wu, Z.; Wang, F.; Wong, R.

2026-06-05 health systems and quality improvement 10.64898/2026.06.03.26354858 medRxiv
Top 1%
0.0%
Show abstract

Although large language models (LLMs) have shown promise for discharge summary generation, their value may be greater in longer hospitalizations, where increasing documentation volume and complexity increase both clinician burden and the risk of communication failures during transitions of care. Prior evaluations of LLM-generated discharge summaries have largely involved shorter stays and have rarely examined receiving-clinician priorities or incidental finding reporting. We compared LLM-generated and human-authored discharge summaries for 60 Internal Medicine hospitalizations lasting 7 to 21 days, with paired assessment by hospitalists and primary care physicians (PCPs). Clinician reviewers preferred LLM-generated summaries for 95% of encounters and rated them higher for quality, readability, factuality and completeness. PCPs, the primary recipients responsible for post-discharge care, found that LLM-generated summaries were better for understanding and communicating hospital care to patients, and providing follow-up care. LLM-generated summaries had fewer annotated errors, primarily due to fewer omissions, without increased estimated harm potential or likelihood compared with human-authored summaries. Benefits of LLM-generated summaries were especially salient for PCPs, who identified more omissions with greater downstream likelihood of harm than hospitalists. This underscores the importance of designing transition documents around the needs of clinicians assuming care post-discharge. LLM identification of radiology incidental findings was generally accurate and appropriate, suggesting potential to improve follow-up of clinically relevant findings. These findings extend prior work by demonstrating clinical value of LLMs in summarizing longer, complex hospitalizations and highlighting the value of stakeholder-centered design in clinical AI systems. Together, they support supervised LLM-assisted discharge summarization as a tool to reduce cognitive burden, improve documentation quality, and enhance transition-of-care communication.

6
Prevalence and factors associated with peripheral artery disease among patients with diabetes mellitus: A cross-sectional study at tertiary hospital in Eastern Uganda

Imalingat, J.; Muyinda, A.; Iraguha, D.; Katuramu, R.; Masaba, P.; Apio, E.; Kebesu, J.; Nankunda, O.; Kirabo, E.; Epuitai, J.; Bwayo, D.

2026-06-05 cardiovascular medicine 10.64898/2026.06.03.26354843 medRxiv
Top 1%
0.0%
Show abstract

Abstract Background Peripheral artery disease (PAD) is a major contributor to morbidity and mortality, particularly among individuals with diabetes mellitus (DM), in whom its prevalence is markedly increased. PAD is often asymptomatic and under-diagnosed, especially in low-resource settings. This study aimed to determine the prevalence of PAD and associated factors among adults with DM in Eastern Uganda. Methods We conducted a hospital-based cross-sectional study at Mbale Regional Referral Hospital from 10th/12/ 2024 to 30th/4/2025. A total of 300 adult patients with DM were consecutively enrolled. Data on sociodemographic characteristics, clinical characteristics, comorbidities, and behavioural risk factors were collected using an interviewer-administered data tool. PAD was assessed using the ankle-brachial index (ABI), defined as [&le;] 0.90. Modified Poisson regression was used to identify factors associated with PAD. As a secondary measure for PAD, we administered the Edinburgh Claudication Questionnaire (ECQ) to capture symptomatic PAD. Results The majority of the participants had a low fruit intake (68%), physical inactivity (54%), and elevated low-density lipoprotein (60%). The prevalence of PAD as measured by ABI was 42.3% (127/300; 95% CI 0.38-0.48), while the magnitude of PAD as measured by ECQ, combining participants with possible claudication and definite claudication was 37.3% 95% CI 31.9 - 42.8). Out of participants with PAD, 15.8% (20/127) were classified as having severe PAD (ABI <0.4). Socio-demographic and clinical factors were assessed for association with PAD. We found no evidence of association between the examined factors such as age (aPR 1.24 95% CI 0.73 - 2.09), sex (aPR 1.46 95% CI 0.84 - 2.55), cholesterol level (aPR 1.39 95% CI 0.86 - 2.25), glycemic control (aPR 1.35 95% CI 0.72 - 2.53), and sedentary behaviour (aPR 1.28 95% CI 0.79-2.08) and PAD. Conclusion The prevalence of PAD was high among adults with DM in Eastern Uganda. Routine health education, and ABI screening of PAD should be done for patients living with DM. The absence of significant associations despite high prevalence of PAD may reflect unmeasured factors e.g. chronic inflammation that may be unique to this population, future prospective studies with larger sample size and more detailed objective measures e.g. inflammatory markers are needed to determine locally relevant modifiable risk factors.

7
Metatranscriptomics-Derived Disease Risk Scores as a Preventive, Diagnostic, and Treatment Support Tool

Hu, L.; Bass, M.; Patridge, E.; Molusky, M.; Antoine, G.; Vuyisich, M.; Banavar, G.

2026-06-06 genetic and genomic medicine 10.64898/2026.05.29.26354333 medRxiv
Top 1%
0.0%
Show abstract

Background: Chronic diseases and symptom syndromes often develop after prolonged biological changes that may precede formal diagnosis. RNA-based metatranscriptomics captures active microbial and human gene expression and may provide a functional layer for disease risk evaluation. To address this translational gap, we developed and validated a Disease Risk Score (DRS) framework that integrates metatranscriptome-derived pathway activity scores from stool, saliva, and blood samples, and evaluated its potential clinical utility as an adjunct risk-evaluation tool. Methods: DRS uses disease-specific sets of pathway activity scores derived from stool and saliva microbial functions, stool and saliva microbial taxa, and blood human gene expression. For each disease, 'not optimal' pathway scores are aggregated into a normalized cumulative odds ratio, or cOR, using score-level odds ratios, statistical significance, and literature-supported biological relevance derived from a Development Cohort of 22,369 individuals. A cOR [&ge;] 5 is defined as high risk. Performance is evaluated in an independent Validation Cohort of 15,908 individuals using self-reported diseases as the reference. Disease support requires both significant cOR separation between self-reported and not-reported (Cohen's d [&ge;] 0.2) and risk ratio enrichment of self-reported disease among individuals classified as high risk (95% CI of Risk Ratio > 1). Results: Of 20 initially evaluated diseases, 15 meet the prespecified validation criteria on the independent validation cohort: ADHD, anxiety, chronic fatigue syndrome, depression, GERD, hypertension, inflammatory bowel disease, IBS-C, IBS-D, insomnia, MASLD, obesity, obstructive sleep apnea, Sjogren's syndrome, and type 2 diabetes. Five selected clinical scenarios illustrate how DRS can support clinician-mediated decision making, including IBS subtype reclassification, improved diagnostic acceptance in IBS-D, personalized lifestyle counseling in MASLD and early type 2 diabetes, and diagnostic uncertainty in atypical GERD. Conclusions: DRS is a metatranscriptomics-based risk-stratification framework that aggregates active microbial and human pathway signals into interpretable disease-specific risk estimates across a wide range of disease conditions. Validation against self-reported disease labels in an independent cohort shows significant risk enrichment for each of 15 diseases. DRS is intended as an adjunct to clinical evaluation: a decision support tool in situations where routine care encounters uncertainty, delay, or low patient engagement. Future prospective studies using clinically adjudicated endpoints are needed to assess calibration and clinical outcomes.

8
Effect of levodopa treatment on gait in older adults with mild parkinsonian signs

Pongmala, C.; Roytman, S.; van Emde Boas, M.; Vangel, R.; Rosano, C.; Bohnen, N.

2026-06-06 geriatric medicine 10.64898/2026.06.04.26354926 medRxiv
Top 1%
0.0%
Show abstract

Background Slow walking in older adults with mild parkinsonian signs (MPS) is a complex, multifactorial phenomenon arising from the cumulative burden of subclinical age-associated pathologies. This decline reflects age-associated neuronal loss in the dopaminergic system. A recent study suggests that levodopa treatment may enhance gait parameters. The goal of this small pilot study is to explore the effect of levodopa treatment on slow walking gait in older adults with MPS. Method This study was a randomized, placebo-controlled clinical pilot trial. Slow walking older adults without clinical evidence of PD were recruited and randomized into 2 groups (active treatment group or placebo control group). Participants in the active group were pre-treated with carbidopa for three days, followed by carbidopa-levodopa for seven days. Spatiotemporal gait parameters were evaluated at baseline and post-intervention. Results Gait factor analysis identified three main factors explaining gait characteristics at baseline, which included gait efficiency, gait rhythmicity, and gait turning.No effect of treatment was observed in the placebo group (p=0.111, p=0.616), no group difference was observed between the placebo and active group at baseline ({beta}=0.310, p=0.547), but a strong trend for a treatment-related increase was observed in the active treatment group ({beta}=0.506, p=0.076). Conclusion Our preliminary data suggest that sustained levodopa treatment (one week) in conjunction with carbidopa pre-treatment and concomitant carbidopa supplementation is feasible in slow walking older adults with MPS. Moreover, the data indicate potential efficacy, showing improvements in cadence, and step durations.

9
An AI-assisted feasibility evaluation of three photoplethysmography-derived microvascular reactivity signals in MIMIC-IV-WDB v0.1.0

Landry, T. C.; Kim, Y.

2026-06-06 health informatics 10.64898/2026.06.03.26354863 medRxiv
Top 1%
0.0%
Show abstract

Background. Capillary refill time, an examiner-dependent bedside test of distal microvascular perfusion, has become a resuscitation target in septic shock,1,2,3,4 motivating a continuous surrogate computed from the photoplethysmogram (PPG, the optical waveform the pulse oximeter on every ICU patient already records).5,6,7,8 Objective. We attempted three PPG-derived candidate measures on the MIMIC-IV Waveform Database (MIMIC-IV-WDB v0.1.0) and asked, by inspecting randomly drawn examples, whether each captured its intended physiology before any downstream modeling. Methods. MIMIC-IV-WDB v0.1.09 was linked to MIMIC-IV.10 The signals were a cuff-anchored perfusion-index recovery (reactive hyperemia when the cuff shares an arm with the probe), a slow Mayer-wave-band power ratio of the perfusion index (sympathetic vasomotor tone), and a per-beat diastolic exponential decay time constant (a refill-like recovery time). For each signal we drew 10 random examples at a fixed seed and checked them against a checklist fixed in advance. Each was read by the author and, separately, by MedGemma 1.5, a multimodal medical language model run locally. A synthetic test with a known time constant checked the third signal. Results. The cuff-anchored signal showed the expected occlusion-reperfusion shape on 268 of 6,236 evaluable cuff cycles (4.30%) in 15 of 19 patients, consistent with opposite-limb placement of the probe and cuff. The slow-band ratio returned a stable cohort value, but a clear, stationary peak appeared in only4 of 10 random windows. The per-beat fit met its goodness-of-fit threshold in 10 of 10 beats, yet a cardiac-frequency heuristic flagged a possible fit on the heart-rate oscillation in 7 of 10, and in 5 of 17 patients the time constant lay where an exponential is indistinguishable from a straight line. A 0.5Hz high-pass pre-filter implanted its own approximately 318 ms time constant regardless of truth. The language model tracked the human on clear positives but reported the pattern present on every call it returned, never absent. Conclusions. Two of the three candidate signals did not reflect their intended physiology in most examples, and the third was constrained by sensor placement. Inspecting a few random raw inputs against a checklist written in advance is an inexpensive upstream check before downstream inference on PPG-derived microvascular signals.

10
From Charting Burden to Workflow Signal: Retrospective Validation of Documentation-Density Measures for ICU Complexity and Long-Stay Risk

Collier, A.

2026-06-06 health informatics 10.64898/2026.06.04.26354922 medRxiv
Top 1%
0.0%
Show abstract

Background Electronic health record documentation patterns may reflect workflow complexity, monitoring intensity, and operational strain in intensive care settings. However, documentation-derived features can be sensitive to local documentation culture, data capture systems, and outcome definitions. Retrospective validation across multiple datasets is therefore needed before these signals are used in workflow intelligence or clinical AI governance tools. Objective To evaluate whether documentation-density and documentation-timing features show reproducible retrospective signal for ICU workflow complexity and long-stay proxy outcomes across de-identified critical care datasets, while distinguishing workflow and long-stay associations from unsupported claims about mortality prediction, burden reduction, or deployment readiness. Methods We synthesized retrospective validation results from de-identified ICU and workflow datasets generated through a prespecified documentation-density validation program. Feature families included Documentation Burden Score style features, Shift-End Documentation Rate style features, documentation reliability style metadata, and all-documentation feature sets where available. Outcomes included long ICU length of stay proxies, mortality where available, and workflow proxy endpoints. Models compared baseline feature sets with enhanced models containing documentation-density or workflow features. Performance was summarized using area under the receiver operating characteristic curve, Brier score where reported, delta AUROC, bootstrap confidence intervals where reported, and label-shuffle controls where available. Results The strongest external long-stay proxy evidence came from the NWICU chartevents analysis, which included 28,612 ICU stays, 20,267 stays with chart events, and 9,619,759 chart events. For ICU length of stay greater than the median, baseline AUROC was 0.5252. Enhanced AUROC was 0.9512 for Documentation Burden Score features, 0.9214 for Shift-End Documentation Rate features, 0.8470 for documentation reliability style features, and 0.9517 for all documentation features. Corresponding label-shuffle enhanced AUROCs were near random, ranging from 0.4897 to 0.5064. For ICU length of stay greater than the 75th percentile, baseline AUROC was 0.5155. Enhanced AUROC was 0.9433 for Documentation Burden Score features, 0.9194 for Shift-End Documentation Rate features, 0.8118 for documentation reliability style features, and 0.9427 for all documentation features, with label-shuffle enhanced AUROCs from 0.4836 to 0.4999. Additional retrospective support was observed in eICU workflow analyses, HiRID first-24-hour documentation-density analyses, MIMIC-IV HF ICU internal analyses, MIMIC-IV-Note metadata extensions, and nursing-chart or lab density proxy analyses. However, cross-institution discrimination transfer was weak without recalibration, and several analyses remained proxy validations rather than final clinical validations. Conclusions Documentation-density and documentation-timing features show promising retrospective signal for ICU workflow complexity and long-stay proxy outcomes, especially in NWICU chartevents and selected internal dataset-specific analyses. These findings support further preregistered, prospective, silent-mode validation of documentation-derived workflow intelligence. They do not establish prospective clinical performance, mortality reduction, clinician burden reduction, autonomous deterioration prediction, or deployment readiness.

11
BodyMAE: A Surface-Area Aware Masked Autoencoder for Body Composition Estimation from 3D Body Scans

Zheng, Y.; Feng, B.; Cheng, R.; Qiu, C.; Long, Z.; Vaziri, K.; Hahn, J.

2026-06-06 health informatics 10.64898/2026.06.04.26354925 medRxiv
Top 1%
0.0%
Show abstract

Accurate assessment of body composition is important to risk stratification and management of metabolic, musculoskeletal, and aging-related diseases, yet reference modalities such as Dual-energy X-ray absorptiometry (DXA) are costly and impractical for frequent monitoring. Commodity 3D body scans offer a low-cost, radiation-free alternative, but extracting meaningful and predictive shape features from scans remains challenging due to nonuniform point density, variable body size and cross-device differences. We introduce BodyMAE, a self-supervised, surface-area aware masked autoencoder for metric-scale 3D body scans. The pipeline integrates area-adjusted sampling, a long-range focused encoder, and a lightweight decoder regularized to promote locally uniform reconstructions. Trained and evaluated on 917 paired 3D body scans paired with clinical DXA reports, BodyMAE achieves strong accuracy on fat percentage (root-mean-square error (RMSE) 3.825 percentage points, R^2 0.908), fat mass (RMSE 3.694 kg, R^2 0.968), and lean mass (RMSE 3.608 kg, R^2 0.901), with competitive performance on bone mineral content (RMSE 0.284 kg, R^2 0.754).We also assess feature stability across pretrained baselines, finding higher retrieval accuracy for our representations (Top-1 90.131%). These results indicate that combining metric-aware sampling, long-range relational encoding, and local geometric regularization enables accurate body composition estimation from 3D body scans, as validated by comparisons to DXA-derived measurements.

12
Beyond Injection Detection: A Positive-Security Prompt Firewall that Closes the Scope and PHI Gap SOTA Classifiers Miss in Healthcare

Schwoebel, J.; Semenec, I.; Rousseva, J.; Frasch, M. G.; Thorstenson, R.; Bhatt, M.

2026-06-06 health systems and quality improvement 10.64898/2026.06.04.26354950 medRxiv
Top 1%
0.0%
Show abstract

Large language models embedded in autonomous agents process trusted instructions and untrusted data in one context window, leaving them open to direct and indirect prompt injection. In healthcare this is not hypothetical: a 2025 JAMA Network Open study found commercial medical LLMs followed injected instructions in 94.4% of simulated patient encounters, including life threatening recommendations . Yet the clinically decisive problem we quantify here is different. Most real clinical threats protected health information PHI exfiltration, cross patient access, bulk export, out of scope advice are fluent, legitimate looking requests that carry no attack signal, so even a state of the art injection detector passes them. Existing runtime guardrails trade safety against latency: model based auditors are accurate but add hundreds of milliseconds of Python inference, while lexical filters are fast but blind to obfuscated or semantically disguised payloads. We present QFIRE, an inline, provider agnostic prompt firewall implemented as a single self contained Rust toolchain proxy, CLI, and benchmark harness. QFIRE combines three mechanisms: (i) positive security scope constraints, which restrict a model call to a declared natural language purpose and block out of scope drift even when no overt attack token is present; (ii) an asynchronous detector graph that runs N rules and their detector nodes concurrently, cheapest checks first; and (iii) a de obfuscation pass that decodes Base64 hex ROT13, folds homoglyphs and leetspeak, and strips zero width characters before detection. QFIRE ships 106 versioned firewall rules and a dedicated HIPAA Safe Harbor 18 identifier PHI panel, and runs a local DeBERTa v3 injection classifier via embedded ONNX Runtime. On 1968 public prompt injection and jailbreak prompts QFIREs deterministic hybrid attains F1 0.86, statistically tied with Metas state of the art PromptGuard 2 0.86 and above protectai DeBERTa v3 0.83; lexical baselines lag 0.16 to 0.50. Our central result is on QFIRE HealthBench, a new 2000 prompt healthcare benchmark we build and release with real garak and Microsoft PyRIT payloads. There the same PromptGuard-2 recovers only 0.40 recall DeBERTa v3 0.57, because most clinical threats carry no injection signal; QFIREs combined scope plus PHI chain reaches 0.83 recall F1 0.87 at a calibrated 0.08 false positive rate. Generic injection detection, even state of the art, is therefore necessary but not sufficient for healthcare agents. A bare LLM judge also closes most of this static corpus gap F1 0.90; QFIREs contribution beyond static accuracy is auditable determinism, bounded latency, and adaptive robustness, where the bare judge falls to 34 to 59% recall section 5.5. End to end, placing QFIRE in front of a tool using agent over a mock EHR sandbox cuts the agents harmful action rate from 0.38 to 0.00 at a 0.13 benign utility cost. All code, rules, corpora snapshots, and scripts are released, and every table regenerates from a single make paper target against local models with no paid API keys.

13
Universal Periodic Review recommendations and trajectories of maternal health between 2005 and 2023: a longitudinal ecological analysis of 89 countries

Uppal, A.; Thomas, R.; De Pasquale, M.; Sillo, J.; Getahun, H.

2026-06-05 public and global health 10.64898/2026.06.03.26354800 medRxiv
Top 1%
0.0%
Show abstract

Background: The Universal Periodic Review (UPR) is a peer-review mechanism established to hold UN Member States accountable for human rights including the right to health, yet evidence on its impact on health outcomes is limited. We evaluated whether UPR engagement is associated with accelerated improvements in maternal health trajectories. Methods and Findings: We conducted a longitudinal ecological analysis of 89 countries with a baseline maternal mortality ratio (MMR) of 70 or greater per 100,000 live births in 2005. Outcomes were trajectories of annual MMR, skilled birth attendance (SBA), and contraceptive prevalence rate (CPR), from 2005 to 2023. The exposure was the volume of health-related UPR recommendations received across three cycles, thematically classified using a validated rule-based algorithm. Mixed-effects models adjusted for time-varying GDP per capita and historical fragility. The 89 countries received 41,733 UPR recommendations across three cycles, of which 405 (1%) were related to maternal health. Maternal health recommendations were preferentially directed at countries with higher baseline MMR and lower SBA. After adjustment, each additional maternal health recommendation was associated with a 0.24% [95% confidence interval (CI): 0.08, 0.40] faster annual reduction in MMR, a 0.52% [0.12, 0.91] faster annual gain in the odds of SBA, and a 0.21% [0.09, 0.34] faster annual gain in the odds of CPR. Broader recommendations on women's health and health systems and services were also associated with faster annual improvements in trajectories across all three outcomes; recommendations on abortion, family planning, sexual health and wellbeing, and sexual education tended to be directed towards lower-burden countries and were not associated with differences in any trajectories. It is important to note that the ecological design precludes causal inference. Conclusions: Receiving UPR recommendations on the themes of maternal health, womens health, and health systems and services are associated with accelerated improvements in maternal health trajectories among high-burden countries. These findings suggest that international human rights accountability mechanisms may have a role in supporting national progress on maternal health.

14
AutoClip: AI-Guided TEE Semantic Segmentation for TEER A Proof-of-Concept Study

Chen, M.; Li, X.; Yang, K.; Taramasso, M.

2026-06-06 cardiovascular medicine 10.64898/2026.05.29.26354195 medRxiv
Top 1%
0.0%
Show abstract

**Abstract** **Background:** Transcatheter edge-to-edge repair (TEER) is an established treatment for mitral regurgitation but remains highly dependent on operator experience and complex transesophageal echocardiography (TEE)-guided intraprocedural imaging. Artificial intelligence (AI)-based semantic segmentation may improve procedural reproducibility and intraprocedural guidance; however, no TEER-specific segmentation framework has been reported. **Objectives:** To develop and evaluate AutoClip, a clinician-driven AI-guided TEE semantic segmentation model designed for simultaneous delineation of mitral valve anatomy and in-vivo TEER device components. **Methods:** A retrospective proof-of-concept study was conducted using 987 intraprocedural TEE frames derived from 10 video clips in 3 patients undergoing MitraClip G4 implantation. Seven semantic labels, including mitral leaflets and device components, were manually annotated using ITK-SNAP. Following standardized preprocessing and region-of-interest extraction, an Attention U-Net architecture was trained frame-wise on bicommissural and corresponding X-plane TEE views. Model performance was assessed using mean intersection-over-union (IoU) and Dice coefficient on an independent test set. **Results:** The Attention U-Net demonstrated improved sensitivity to small device structures compared with conventional U-Net architectures. Preliminary training performance achieved a mean IoU of approximately 0.93, while independent test performance reached a mean IoU of 0.46 across foreground classes. Qualitative assessment demonstrated feasible simultaneous segmentation of mitral leaflets, clip arms, grippers, and delivery shaft during TEER procedures. **Conclusions:** AutoClip represents a proof-of-concept TEER-specific TEE semantic segmentation framework initiated through a clinician-oriented workflow without formal computer science expertise. Although preliminary accuracy remains modest due to limited sample size, this study establishes a reproducible pathway for future AI-assisted intraprocedural guidance systems and larger multicenter development efforts in structural heart interventions.

15
An integrated proteogenomic investigation of the human liver uncovers molecular drivers of steatotic liver disease

Gobeil, E.; Bourgault, J.; Enault, M.; Cote, V.; Mitchell, P. L.; Ruel, L.-J.; Girard, A. S.; Vohl, M.-C.; Arsenault, B. J.

2026-06-06 endocrinology 10.64898/2026.06.04.26354903 medRxiv
Top 1%
0.0%
Show abstract

Metabolic dysfunction-associated steatotic liver disease (MASLD) is rapidly increasing worldwide, yet effective targeted therapies remain limited. To better understand the molecular mechanisms underlying MASLD, we performed an integrated proteogenomic analysis of human liver tissue. Using mass spectrometry, we quantified 2,744 proteins in 504 liver biopsies from the Quebec Obesity Biobank and examined changes across disease stages. To investigate causality, we integrated liver proteomics with RNA sequencing and genome-wide genotyping to map thousands of protein quantitative trait loci (pQTLs) and expression quantitative trait loci (eQTLs). These molecular data were combined with summary statistics from a meta-analysis of genome-wide association studies including 16,532 MASLD cases and 1,240,188 controls. Mendelian randomization and genetic colocalization analyses revealed that most proteins differentially expressed across MASLD stages were not causally implicated in disease risk, whereas several genetically predicted liver proteins showed evidence of causal effects. Among these, higher hepatic levels of the MTARC1 protein were causally associated with MASLD and hepatic fat accumulation. Phenome-wide analyses suggested that MTARC1 inhibition may reduce the risk of cirrhosis, hepatocellular carcinoma, and cholelithiasis while improving lipid profiles. Notably, the causal MTARC1 variant influenced liver protein levels but not gene expression. Genetic analyses also identified ERLIN1 and HSD17B13 as potential therapeutic targets. In contrast, eQTLs and pQTLs at other loci such as GCKR showed opposite effects on MASLD risk. These findings highlight the importance of integrating tissue proteomics with human genetics to distinguish biomarkers from causal drivers and to identify promising therapeutic targets for MASLD.

16
Serological thresholds of risk reduction for infant group B streptococcus disease

Cantrell, L.; Karampatsas, K.; Andrews, N.; Beach, S.; Bentley, E.; Berardi, A.; Bijlsma, M. W.; Cagil Kocana, C.; Daniel, O.; French, N.; Hall, T.; Izu, A.; Khalil, A.; Kwatra, G.; Kyohere, M.; Madhi, S. A.; Mboizi, R.; Miselli, F.; Nielsen, M.; Thorn, N.; van de Beek, D.; Walker, K.; Heath, P. T.; Le Doare, K.; Voysey, M.; PREPARE WP3 Study Group,

2026-06-06 epidemiology 10.64898/2026.05.29.26353453 medRxiv
Top 1%
0.0%
Show abstract

Vaccines to prevent infant group B streptococcus (GBS) disease are advancing, with licensure likely based on safety and immunologic endpoints rather than clinical efficacy data. This approach requires robust, generalisable serological thresholds of risk reduction (SToRRs). We combined data from six case-control studies in Europe and Africa to define SToRRs for early-onset (EOD) and late-onset (LOD) GBS disease. Across diverse epidemiological and healthcare settings, anti-capsular polysaccharide IgG concentrations were consistently higher in infants who remained disease free than in those who developed disease. Higher antibody concentrations were required to reduce the risk of EOD than LOD, and higher concentrations were required for serotype Ia than for serotype III. This study provides a quantitative framework to support correlates-based evaluation and potential licensure of maternal GBS vaccines.

17
Direct and mediated effects (DME) SLCMA: a novel method for life course modelling with time-varying covariates

Beer, S.; Simpkin, A. J.; Eldeeb, S. Y.; Zar, H. J.; Stein, D. J.; Dunn, E. C.; Smith, A. D. A. C.

2026-06-06 epidemiology 10.64898/2026.05.29.26354427 medRxiv
Top 1%
0.0%
Show abstract

Background: In prospective cohort studies, where an exposure is collected repeatedly, interest often lies in determining whether the timing of that exposure has a differential effect on a later outcome. The Structured Life Course Modeling Approach (SLCMA), where users select between temporal hypotheses of exposure specified a priori, provides one way to analyse such longitudinal data. However, few studies using SLCMA consider the effect of time-varying covariates (TVC) which may impact associations. Methods: We present a modified version of the SLCMA - called direct and mediated effects (DME)-SLCMA - which corrects for TVC. We first develop the DME-SLCMA method, test it through simulation, and apply it to psychosocial data from the Drakenstein Child Health Study (DCHS, n=336) to investigate relationships between maternal psychopathology, TVC of socioeconomic status, and offspring depressive symptoms. Results: We found that, on average, offspring depressive symptoms score increased by 3.9% (95% CI: 1.0%-6.9%, p = 0.039) for each unit of maternal psychopathology (SRQ) at 48 months whilst adjusting for time-varying socioeconomic status (at 18, 30, 42 and 54 months). Our simulations identified several realistic scenarios where selections ignoring TVC - with TVC mediated exposure effects present - were prone to be incorrect, including our DCHS example. Conclusion: DME-SLCMA is a robust new approach for life course modelling in the presence of time-varying covariates. We recommend adjusting for TVC whenever possible, and, when not possible, our simulation study identified that scenarios where mediated effects are comparable, or greater, in magnitude to direct effects are most prone to confounding.

18
Surfacing Suicidal Risk Through Simulated Social Interaction: Per-Person Language Model Agents as Communicative Stress Tests

shao, w.; Ammerman, B.; Jacobucci, R.

2026-06-06 psychiatry and clinical psychology 10.64898/2026.06.04.26354928 medRxiv
Top 1%
0.0%
Show abstract

Suicidal risk may be encoded in everyday communication patterns but diluted in routine digital interactions. We introduce a method for surfacing this latent signal: training per-person language model agents on individuals' authored text (the on-screen text each participant typed, captured whenever a keyboard was visible in screenshots) and placing those agents in simulated social interactionsa communicative stress test. Using data from 79 adults with recent suicidal ideation, we ne-tuned individual LoRA adapters on Qwen3-8B using each participant's authored text, then placed agents in standardized conversations with probe personas. Agent-generated risk language was associated with EMA-measured suicidal ideation (r= .576, p < .001), with a single neutral small-talk probe performing nearly as well (r= 551). A shue control conrmed the signal is person-specic (r= .071 when adapters were mismatched), and automated descriptions of participants' general smartphone activity produced no signal, conrming specicity to interpersonal communication. A prompt ablation demonstrated partial robustness to removal of disclosure-encouraging language (r = .430). This proof-of-concept demonstrates that simulated social interaction can amplify latent vulnerability signals, bridging digital phenotyping, generative AI, andsuicide theory.

19
Multimodal neuroimaging approach for cognitive impairment in Alzheimer disease

Gonzales, M.; Kang, X.; Adamson, M. M.; Chao, S. Z.; Yoon, B. C.

2026-06-06 radiology and imaging 10.64898/2026.06.04.26354924 medRxiv
Top 1%
0.0%
Show abstract

PURPOSE: Alzheimer disease (AD) is associated with cognitive impairment, brain atrophy, and elevated amyloid-beta and tau. The study aimed to characterize regional atrophy associated with elevated amyloid-beta and tau, as measured by [18F]florbetapir (FBP) and [18F]flortaucipir (FTP) positron emission tomography (PET), respectively, and determine whether combining PET and atrophy data improves the prediction of cognitive impairment. METHODS: Alzheimer Disease Neuroimaging Initiative data (n = 381) were retrospectively analyzed. PET results were correlated with cortical thickness, gray matter (GM) volumes, Mini-Mental State Examination, and Montreal Cognitive Assessment. Linear/logistic regression and area under the curve (AUC) were used to evaluate for significant correlations and compare performances in distinguishing cognitive impairment, respectively. RESULTS: Incremental loss of cortical thickness and GM volume was observed from FBP-/FTP- (n = 205) to single PET-positive (FBP+/FTP-, n = 133; FBP-/FTP+, n = 5) and FBP+/FTP+ (n = 38) groups, particularly in the temporal and parietal lobes. FBP+/FTP+ showed the most severe cortical thickness loss in the entorhinal cortex, temporal lobe GM atrophy, and cognitive impairment. Adding brain atrophy as the third variable resulted in higher odds ratios and improved AUCs for cognitive impairment, with FBP+/FTP+/temporal GM or entorhinal cortical atrophy+ demonstrating the strongest associations with cognitive impairment. CONCLUSION: A multimodal approach combining PET and MRI may help improve the assessment of cognitive impairment in AD.

20
Dementia and Frailty Impact Postoperative Care Trajectories and Burden among Older Adults Undergoing Radical Cystectomy for Bladder Cancer

Ernandez, J.; Xiang, L.; Adler, R.; Hsu, J.; Shah, S. K.; Kim, D.; Gershman, B.; Mossanen, M.; Weissman, J. S.

2026-06-06 urology 10.64898/2026.06.04.26354768 medRxiv
Top 1%
0.0%
Show abstract

OBJECTIVE: Bladder cancer (BC) is predominantly a disease of older, comorbid adults, and radical cystectomy (RC), which is the gold standard treatment, carries considerable morbidity. We sought to determine the impact of baseline dementia and frailty on the care trajectory beyond the immediate postoperative period. We hypothesized that frail patients and those with dementia undergoing RC for BC will have poorer care trajectories. METHODS AND MATERIALS: We identified Medicare beneficiaries [&ge;] 66 years old who underwent RC for BC in 2017 with 12 months of pre- and post-RC enrollment. Frailty and dementia were characterized using validated, claims-based measures. Associations between baseline frailty and dementia with postoperative care trajectory outcomes were determined using Fine-Gray competing risk models. RESULTS: We identified 3,600 beneficiaries of whom 11.6% were frail and 3.4% met criteria for dementia. Patients with dementia were more likely to be frail, comorbid, and not receive standard-of-care neoadjuvant chemotherapy. Frailty was independently associated with [&ge;] 2 transitions in care level after index discharge from RC and skilled nursing facility (SNF) admissions within 1 year of RC, exposure to intensive post-RC interventions, including dialysis and feeding tube placement, and poorer survival. Dementia remained associated with SNF admissions regardless of frailty level. CONCLUSIONS: Among a contemporary cohort of older adults undergoing RC for BC, preoperative dementia and frailty were independently associated with poorer care trajectory beyond the immediate postoperative period after RC. Our work highlights a role for preoperative geriatric assessment in identifying and optimizing patients at greatest risk.